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Abstract 

. The Baire metric induces an ultrametric on a dataset and is of linear 

^ computational complexity, contrasted with the standard quadratic time 

agglomerative hierarchical clustering algorithm. In this work we evaluate 
I I empirically this new approach to hierarchical clustering. We compare 

hierarchical clustering based on the Baire metric with (i) agglomerative 
hierarchical clustering, in terms of algorithm properties; (ii) generalized 
^ ultrametrics, in terms of definition; and (iii) fast clustering through k- 

means partititioning, in terms of quality of results. For the latter, we 
carry out an in depth astronomical study. We apply the Baire distance 
to spectrometric and photometric redshifts from the Sloan Digital Sky 
Survey using, in this work, about half a million astronomical objects. 
^^O We want to know how well the (more costly to determine) spectrometric 

redshifts can predict the (more easily obtained) photometric redshifts, i.e. 
] we seek to regress the spectrometric on the photometric redshifts, and we 

. . use clusterwise regression for this. 
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1 Introduction 

Our work has quite a range of vantage points, including the following. Firstly, 
there is a particular distance between observables, which happens to be also a 
"strong" or ultrametric distance. Section[2]defines this. This same section notes 
how the encoding of data is quite closely associated with the determining of the 
distance. 

Next, in section |3.1| we take the vantage point of clusters, and of sets of 
clusters. 

Finally, in section[4]we wrap up on the hierarchy that is linked to the distance 
used, and to the set of clusters. 



1 



So wc have the fohowing aspects and vantage points: distance, ultrametric, 
data encoding, cluster or set (and membership), sets of clusters (and their in- 
terrelationships) , and hierarchical clustering. Those aspects and vantage points 
are discussed in the first part of this article. They arc followed by case studies 
and applications in subsequent sections. We have not, in fact, exhausted the 
properties and aspects of our new approach. For example, among issues that 
we will leave for further in depth exploration are: p-adic number representation 
spaces; and hashing, data retrieval and information obfuscation. 

The following presents a general scene-setting where we introduce metric 
and ultrametric, wc describe some relevant discrete mathematical structures, 
and we note some computational properties. 

1.1 Agglomerative Hierarchical Clustering Algorithms 

A metric space (X, d) consists of a set X on which is defined a distance function 
d which assigns to each pair of points of X a distance between them, and satisfies 
the following four axioms for any triplet of points x, y, z: 

Al: yx,y e X,d{x,y) > (positiveness) 
A2: Vx, y £ X, d{x, y) = iS x — y (reflexivity) 
A3: Vx, y G X, d{x, y) = d{y, x) (symmetry) 



A4: Vx, y,z G X, d{x, z) < d{x, y) + d{y, z) (triangle inequality) 

When considering an ultrametric space we need to consider the strong tri- 
angular inequality or ultrametric inequality defined as: 

A5: d{x,z) < max {d{x,y), d{y,z)} (ultrametric inequality) 

and this in addition to the positivity, reflexivity and symmetry properties (prop- 
erties Al, A2, A3) for any triple of point x,y,z G X. 

If X is endowed with a metric, then this metric can be mapped onto an 
ultrametric. In practice, endowing X with a metric can be relaxed to a dissim- 
ilarity. An often used mapping from metric to ultrametric is by means of an 
agglomerative hierarchical clustering algorithm. A succession of n — 1 pairwise 
merge steps takes place by making use of the closest pair of singletons and/or 
clusters at each step. Here n is the number of observations, i.e. the cardinality 
of set X. Closeness between singletons is furnished by whatever distance or 
dissimilarity is in use. For closeness between singleton or non-singleton clusters, 
we need to define an inter-cluster distance or dissimilarity. This can be defined 
with reference to the cluster compactness or other property that we wish to op- 
timize at each step of the algorithm. In terms of advising a user or client, such 
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a cluster criterion, motivating the inter-cluster dissimilarity, is best motivated 
in turn by the data analysis application or domain. 

Since agglomerative hierarchical clustering requires consideration of pairwise 
dissimilarities at each stage it can be shown that even in the case of the most 
efficient algorithms, e.g. those based on reciprocal nearest neighbors and nearest 
neighbor chains [5D], O(n^) or quadratic computational time is required. The 
innovation in the work we present here is that we carry out hierarchical cluster- 
ing in a different way such that 0{n) or linear computational time is needed. 
As always in computational theory, these are worst case times. 

A hierarchy, H, is defined as a binary, rooted, node-ranked tree, also termed 
a dendrogram [31 [T71 [THl HO] . A hierarchy defines a set of embedded subsets of a 
given set of objects X, indexed by the set /. These subsets are totally ordered 
by an index function v, which is a stronger condition than the partial order 
required by the subset relation. A bijection exists between a hierarchy and an 
ultrametric space. 

Let us show these equivalences between embedded subsets, hierarchy, and 
binary tree, through the constructive approach of inducing H on a set /. 

Hierarchical agglomeration on n observation vectors with indices i G / in- 
volves a series of 1,2,. — 1 pairwise agglomerations of observations or clus- 
ters, with properties that follow. 

In order to simplify notation, let us use the index i to represent also the 
observation, and also the observation vector. Hence for i = 3 and the third - in 
some sequence - observation vector, Xi = X3, we will use i to also represent xt 
in such a case. 

A hierarchy H = {q\q e 2^} such that (i) I e H, (ii) i G H Wi, and (iii) for 
each q & H,q' £ H : q n q' ^ ^ g C g' or g' C q. Here we have denoted 
the power set of set / by 2^. An indexed hierarchy is the pair {H, i/) where the 
positive function defined on H, i.e., v : H ^ M"*", satisfies: v{i) = if i € is 
a singleton; and (ii) q C q' =^ v{q) < v{q'). Here we have denoted the positive 
reals, including 0, by ]R+. Function v is the agglomeration level. Take q C q" 
and q' C g", and let g" be the lowest level cluster for which this is true. Then 
if we define D{q,q') = v{q"), D is an ultrametric. 

In practice, we start with a Euclidean or alternative dissimilarity, use some 
criterion such as minimizing the change in variance resulting from the agglom- 
erations, and then define ^{q) as the dissimilarity associated with the agglom- 
eration carried out. 

2 Baire or Longest Common Prefix Distance 

Agglomerative hierarchical clustering algorithms are constructive hierarchy-constructing 
algorithms. Such algorithms have the aim of mapping data into an ultrametric 
space, or searching for an ultrametric embedding, or ultrametrization [30) . 

Now, inherent ultrametricity leads to an identical result with most commonly 
used agglomerative criteria [20j . Furthermore, data coding can help greatly 
finding how inherently ultrametric data is |21j . In certain respects the hierarchy 



3 



determined by the Baire distance can be viewed as a particular coding of the 
data because it seeks longest common prefixes in pairs of (possibly numerical) 
strings. We could claim that determining the longest common prefix is a form 
of data compression because we can partially express one string in terms of 
another. 

2.1 Ultrametric Baire Space 

A Baire space consists of countably infinite sequences with a metric defined in 
terms of the longest common prefix: the longer the common prefix, the closer a 
pair of sequences. What is of interest to us here is this longest common prefix 
metric, which we call the Baire distance [26l l6]. 

Consider real-valued or floating point data (expressed as a string of digits 
rather than some other form, e.g. using exponent notation). The longest com- 
mon preflxes at issue are those of precision of any value. For example, let us 
consider two such values, Xi and yj, with i and j ranging over numeric digits. 
When the context easily allows it, we will call these x and y. 

Without loss of generality we take x and y to be real-valued and bounded 
by and 1. 

Thus we consider ordered sets Xk and yk for k Cz K. In line with our notation, 
we can write Xk and yk for these numbers, with the set K now ordered. So, 
A; = 1 is the flrst decimal place of precision; fc = 2 is the second decimal place; 
. . . ; k = \K\ is the \K\ th decimal place. The cardinality of the set K is the 
precision with which a number, Xk, is measured. 

Take as examples Xk = 0.478; and yk = 0.472. In these cases, \K\ = 3. Start 
from the first decimal position. For fc = 1, we find Xk = yk = 4. For /c = 2, 
Xk=yk ■ But for fc = 3, Xfe 7^ yk. 

We now introduce the following distance (case of vectors x and y, with 1 
attribute, hence unidimensional) : 



We call this value Baire distance, which is seen to be an ultrametric (STJ [22l 
[21 [Ml 121] distance. 

Note that the base 2 is given for convenience. When dealing with binary 
data X, y, then 2 is the chosen base. When working with real numbers the base 
can be redefined to 10 if needed. 

2.2 Constructive Hierarchical Clustering Algorithm ver- 
sus Hierarchical Encoding of Data 

The Baire distance was introduced and described by Bradley j4j in the context of 
inducing a hierarchy on strings over finite alphabets. This work further pursued 
the goal of embedding a dendrogram in a p-adic Bruhat-Tits tree, informally 
characterized as a "universal dendrogram" . 



dB{xK,yK) 



1 

inf 2-*= 



if xi ^ yi 

Xk ^Vk 1 < fc < \K\ 



(1) 
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By convention we denote a prime by p, and a more general, prime or non- 
prime, positive integer by m. 

A geometric foundation for ultrametric structures is presented in Bradley 
[3]. Starting from the point of view that a dendrogram, or ranked or unranked, 
binary or more general m-way, tree, is an object in a p-adic geometry, it is 
noted that: "The consequence of using p-adic methods is the shift of focus 
from imposing a hierarchic structure on data to finding a p-adic encoding which 
reveals the inherent hierarchies." 

This summarizes well our aim in this work. We seek hierarchy and rather 
than using an agglomerative hierarchical clustering algorithm which is of quadratic 
computational time (i.e., for n individuals or observation vectors, O(n^) com- 
putational time is required) we instead seek to read off a p-adic or m-adic tree. 
In terms of a tree, p-adic or m-adic mean p-way or m-way, respectively, or that 
each node in the tree has at most p or m, respectively, sub-nodes. 

Furthermore, by "reading off" we are targeting a linear time, or 0{n) al- 
gorithm involving one scan over the dataset, and we are imposing thereby an 
encoding of the data. (We recall that n is the number of observations, or cardi- 
nality of the observation set X.) 

In practice we will be more interested in this work in the hierarchy, and 
the encoding algorithm used is a means towards this end. For a focus on the 
encoding task, see 

3 The Set of Clusters Perspective 

3.1 The Baire Ultrametric as a Generalized Ultrametric 

While the Baire distance is also an ultrametric, it is interesting to note some links 
with other closely related data analysis and computational methods. We can, 
for example, show a relationship between the Baire distance and the generalized 
ultrametric, which maps the cross-product of a set with itself into the power 
set of that set's attributes. A (standard) ultrametric instead maps the cross- 
product of a set with itself into the non-negative reals. We pursue this link with 
the generalized ultrametric in section [3.1.1[ 

We also discuss the data analysis method known as Formal Concept Analysis 
as a special case of generalized ultrametrics. This is an innovative vantage point 
on Formal Concept Analysis because it is usually motivated and described in 
terms of lattices, which structure the data to be analyzed. We pursue this link 
with Formal Concept Analysis in section [3.1.2[ 

We note that agglomerative hierarchical clustering, expressed as a 2-way (or 
"binary") tree, has been related to lattices by, e.g., Lerman [TH], Janowitz [TC] . 
and others. 

3.1.1 Generalized Ultrametrics 

In this section, our focus is on the clusters determined, and on the relationships 
between them. What we pursue is exemplified as follows. Take x = 0.4578,?/ = 
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0.4538. Consider the Baire distance between x and y as (base 10) 10~^. Let 
us look at the cluster where they share membership - it is the cluster defined 
by common first digit precision and common second digit precision. We are 
interested in a set of such clusters in this section. 

The usual ultrametric is an ultrametric distance, i.e. for a set I, d : I x I — y 
M"*". Thus, the ultrametetric distance is a positive real. 

The generalized ultrametric is also consistent with this definition, where the 
range is a subset of the power set: d : I x I — > F, where F is a partially ordered 
set with least element. See [M]. The least element is a generalized way of seeing 
zero distance. Some areas of application of generalized ultrametrics will now be 
discussed. 

Among other fields, generalized ultrametrics are used in reasoning. In the 
theory of reasoning, a monotonic operator is rigorous application of a succession 
of conditionals (sometimes called consequence relations). However negation 
or multiple valued logic (i.e. encompassing intermediate truth and falsehood) 
requires support for non-monotonic reasoning, where fixed points are modeled 
as tree structures. See [11] . 

A direct application of generalized ultrametrics to data mining is the fol- 
lowing. The potentially huge advantage of the generalized ultrametric is that 
it allows a hierarchy to be read directly off the I x J input data, and bypasses 
the 0{n'^) consideration of all pairwise distances in agglomerative hierarchical 
clustering. Let us assume that the hierarchy is induced on the observation set, 
/, which are typically given by the rows of the input data matrix. In |26] we 
study application to chemoinformatics. Proximity and best match finding is 
an essential operation in this field. Typically we have one million chemicals 
upwards, characterised by an approximate 1000-valued attribute encoding. The 
set of attributes is J, and the number of attributes is the cardinality of this set, 
|J|. 

Consider first our need to normalize the data. We divide each boolean 
(presence/absence) attribute value by its corresponding column sum. 

We can consider the hierarchical cluster analysis from abstract posets as 
based on a distance or even dissimilarity d : / x / — > M'^'I. The | J|-dimensional 
reals are the domain here. 

As noted in section [l] we can consider embedded clusters corresponding to 
the minimal Baire distance (in definition ([l]) this is seen to be — 0.5). 
The Baire distance induces the hierarchical clustering, and this hierarchical 
clustering is determined from the Baire disances. So it is seen how the Baire 
distance maps onto real valued numbers (cf. definition ([T])) and as such is a 
metric. But the Baire distance also maps onto a hierarchical clustering, i.e. a 
partially ordered set of clusters, and so, in carrying out this mapping, the Baire 
distance gives rise to a generalized ultrametric. 

Our Baire-based distance and simultaneously ultrametric is a particular case 
of the generalized ultrametric. 

Figures [5] and [6j to be studied below in section 6T show how a set of results, 
related to the range set, E'"'' , which are - in practice - further processed in order 
to provide the cluster memberships. 
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3.1.2 Link with Formal Concept Analysis 

We pursue the case of an ultrametric defined on the power set or join semilattice. 
Comprehensive background on ordered sets and lattices can be found in [TU] . A 
review of generalized distances and ultrametrics can be found in |29j . 

Typically hierarchical clustering is based on a distance (which can be relaxed 
often to a dissimilarity, not respecting the triangular inequality, and mutatis 
mutandis to a similarity), defined on all pairs of the object set: d : X xX ^ K+. 
I.e., a distance is a positive real value. Usually we require that a distance cannot 
be 0-valued unless the objects are identical. That is the traditional approach. 

A different form of ultrametrisation is achieved from a dissimilarity defined 
on the power set of attributes characterising the observations (objects, individ- 
uals, etc.) X. Here we have: d : X x X — > 2"^ , where J indexes the attribute 
(variables, characteristics, properties, etc.) set. 

This gives rise to a different notion of distance, that maps pairs of objects 
onto elements of a join semilattice. The latter can represent all subsets of the 
attribute set, J. That is to say, it can represent the power set, commonly 
denoted 2-^, of J. 

As an example, consider, say, n = 5 objects characterised by 3 boolean 
(presence/absence) attributes, shown in Figure [l] (top). Define dissimilarity 
between a pair of objects in this table as a set of 3 components, corresponding 
to the 3 attributes, such that if both components are 0, we have 1; if either 
component is 1 and the other 0, we have 1; and if both components are 1 we 
get 0. This is the simple matching coefficient. We could use, e.g., Euclidean 
distance for each of the values sought; but here instead we treat values in 
both components as signalling a 1 contribution (hence, is a data encoding of 
a property rather than its absence). We get then d{a, b) = 1,1,0 which we will 
caU dl,d2. Then, d{a,c) = 0, 1,0 which we will caU d2. Etc. With the latter, 
dl,d2 here, d2, and so on, we create lattice nodes as shown in the middle part 
of Figure [T] So, note in this figure, how the order relation holds between dl ,d2 
at level 2 and d2 at level 1. 

In Formal Concept Analysis [101 [H], it is the lattice itself which is of primary 
interest. In |16) there is discussion of, and a range of examples on, the close 
relationship between the traditional hierarchical cluster analysis based on d : 
I X I ^ M+, and hierarchical cluster analysis "based on abstract posets" (a 
poset is a partially ordered set), based on c? : / x J — >■ 2'^. The latter, leading to 
clustering based on dissimilarities, was developed initially in I15j. 

Thus, in Figure [!} we have d{a,b) -> dl,d2, (a,/) dl,d2, d{a,e) 
d2,d3, and so on. We note how the dl,d2 etc. are sets that are subsets of the 
power set of attributes, 2'^. 
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1 





1 


b 
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1 
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e 


1 








f 
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Potential lattice vertices Lattice vertices found Level 

dl,d2,d3 dl,d2,d3 3 

/ \ 

/ \ 

dl,d2 d2,d3 dl,d3 dl,d2 d2,d3 2 

\ / 
\ / 

dl d2 d3 d2 1 



The set dl,d2,d3 corresponds to: d{b,e) and d{e,f) 

The subset dl,d2 corresponds to: d{a, b), d{a, f),d{b, c), d{b, /), and d{c, /) 
The subset d2,d3 corresponds to: d{a, e) and d{c, e) 
The subset d2 corresponds to: d{a, c) 

Clusters defined by all pairwise linkage at level < 2: 
a, b, c, f 
a, c, e 

Clusters defined by all pairwise linkage at level < 3: 

a, c, e, / 

Figure 1: Top: example data set consisting of 5 objects, characterized by 3 
boolean attributes. Then: lattice corresponding to this data and its interpreta- 
tion. 
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4 A Baire-Based Hierarchical Clustering Algo- 
rithm 



We have discussed Formal Concept Analysis as a particular case of the use 
of generalized ultrametrics. We noted that a nice feature of the generalized 
ultrametric is that it may allow us to directly "read off" a hierarchy. That in 
turn, depending of course on the preprocessing steps needed or other properties 
of the algorithm, may be computationally very efficient. 

Furthermore, returning further back to section 2.1 we note that the ultra- 
metric Baire space can be viewed in a generalized ultrametric way. We can view 
the output mapping as being a restricted subset of the power set of the set K of 
digits of precision. Alternatively expressed, the output mapping is a restricted 
subset of the power set, 2^. Why restricted? - because we are only interested 
in a longest common prefix sequence of identical digits, and not in the sharing 
of any arbitrary precision digits. 

A straightforward algorithm for hierarchical clustering based on the Baire 
distance, as described in section [O] is as follows. Because of working with real 
numbers in our case study below, we define the base in relation ([T]) as 10 rather 
than 2. 

For the first digit of precision, fc = 1, consider 10 "bins" corresponding 
to the digits 0,1,..., 9. For each of the nodes corresponding to these bins, 
consider 10 subnode bins corresponding to the second digit of precision, k — 2, 
associated with 0, 1, . . . , 9 at this second level. We can continue for a third and 
further levels. In practice we will neither permit nor wish for a very deep (i.e., 
with many levels) storage tree. For the base 10 case, it is convenient for level 
one (corresponding to k — 1) to give rise to up to 10 clusters. For level two 
(corresponding to fc = 2) we have up to 100 clusters. We see that in practice 
a small number of levels will suffice. In one pass over the data we map each 
observation (recall that it is univariate but we are using its ordered set of digits, 
i.e. ordered set K) to its bin or cluster at each level. For £ levels, the computation 
required is n ■ £ operations. For a given value of £ we therefore have 0{n) 
computation - and furthermore with a very small constant of proportionality 
since we are just reading off the relevant digit and, presumably, updating a node 
or cluster membership list and cardinality. 



5 Astronomical Case Study 
5.1 The Sloan Digital Sky Survey 

The Sloan Digital Sky Survey (SDSS) [28 is systematically mapping the sky 
producing a detailed image of it and determining the positions and absolute 
brightnesses of more than 100 million celestial objects. It is also measuring the 
distance to a million of the nearest galaxies and to a hundred thousand quasars. 
The acquired data has been openly given to the scientific community. 

Figure [2] depicts the SDSS Data Release 5 for imaging and spectral data. 
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For every object a large number of attributes and measurements are acquired. 
See [T] for a description of the data available in this catalog. 



Imaging 




Figure 2: Distribution in the sky of the SDSS Data Release 5 

In particular we use the data that has been studied by Longo group TO] and 
used intensively by Longo and D'Abrusco [71IH1IS]- 

5.2 Doppler Effect and Redshift 

Light from moving objects will appear to have different wavelengths depending 
on the relative motion of the source and the observer. On the one hand we 
have that if an object is moving towards an observer, the light waves will be 
compressed from the observer viewpoint, then the light will be shifted to a 
shorter wavelength or it will appear to be blue shifted. On the other hand if the 
object is moving away from the observer, the light wavelength will be expanding, 
thus red shifted. This is also called Doppler effect (or Doppler shift) named after 
the Austrian physicist Christian Doppler, who first described this phenomenon 
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in 1845. A very important piece of information obtained in cosmology from the 
Doppler shift is to know if an object is moving towards or away from us, and 
the speed at which this is happening. 

Spectrometric measurement of redshift: under certain conditions all 
atoms can be made to emit light, doing so at particular wavelengths, which 
can be measured accurately. Chemical compounds are a combination of dif- 
ferent atoms working together. Thus, when measuring the precise wavelength 
at which a particular chemical radiates we are effectively obtaining a signature 
of this chemical. These emissions are seen as lines (emission or absorption) in 
the electromagnetic spectrum. For example, hydrogen is the simplest chemical 
element with atomic number 1, and also is the most abundant chemical in the 
universe. Hydrogen has emission lines at 6562.8 A, 4861.3 A, 4340 A, 4102.8 
A, 4102.8 A, 3888.7 A, 3834.7 A and 3798.6 A (where A is an Angstrom equal 
to 10~^°m). If the spectrum of a celestial body has emission lines in these 
wavelengths we can conclude that hydrogen is present there. 

Photometric measurement of redshift: sometimes obtaining spectro- 
metric measurements can be very difficult due to the large number of objects 
to observe or because the signal is too weak for the current spectrometric tech- 
niques. A redshift estimate can be obtained using large/medium band photom- 
etry instrumentation instead of spectrometric. This technique is based on the 
identification of strong spectral features. This is much faster than spectrometric 
measurement but also of lesser quality and precision |11| . 

Hence the context of our clustering work is to see how well the more easily 
obtained photometric redshifts can be used as estimates for the spectrometric 
redshifts that are obtained with greater cost. We limit our work here to the fast 
finding of clusters of associated photometric and spectrometric redshifts. In 
doing so, we find some interesting new ways of finding good quality mappings 
from photometric to spectrometric redshifts with high confidence. 

6 Inducing a Hierarchy on the SDSS Data using 
the Baire Ultrametric 

The aim here is to build a mapping from z<,pec — ^ Zphot to help calibrating 
the redshifts, based on the Zspec observed values. Traditionally we could map 
/ : Zphot — > Zspec based on trained data. That is to say, having set up the 
calibration, we determine the higher quality information from the more read- 
ily available less high quality information. The mapping / could be linear 
(e.g. linear regression) or non-linear (e.g. multilayer perceptron) as used by 
D'Abrusco [9]. These techniques are global. Here our interest is to develop 
a locally adaptive approach based on numerical precision. That is the direct 
benefit of the (very fast, hierarchical) clustering based on the Baire distance. 

We look specifically into four parameters: right ascension (RA), declination 
(DEC), spectrometric (zspec) and photometric [zpUot) redshift. Table [T] shows a 
small subset of the data used for experimentation and analysis. 
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As already noted the spectrometric technique uses the spectrum of electro- 
magnetic radiation (including visible hght) which radiates from stars and other 
celestial objects. The photometric technique uses a faster and economical way 
of measuring the redshifts. 



RA 


DEC 


Spec 


Phot 


145.4339 


0.56416792 


0.14611299 


0.15175095 


145.42139 


0.53370196 


0.145909 


0.17476539 


145.6607 


0.63385916 


0.46691701 


0.41157582 


145.64568 


0.50961215 


0.15610801 


0.18679948 



Table 1: Data format for right ascension, declination, zspec ('^nd zphot- 



6.1 Clustering SDSS Data 

We use clustering to support a nearest neighbor regression. Hence we are inter- 
ested in the matching up to some level of precision between pairs of Zjpec and 
Zphot values that are assigned to the same cluster. 

In order to perform the clustering process introduced in section |2.1| and fur- 
ther described in|4j we compare every Zgpec and ZpUot data point searching for 



common prefixes based on the longest common prefix (see section 2.1). There- 
after, the data points that have digit coincidences are grouped together to form 
clusters. 

Data characterisation is presented in Figure |3] The left panel shows the 
Zspec and Zphot sky coordinates of the data currently used by us to cluster 
redshifts. This section of the sky presents approximately 0.5 million object 
coordinate points with the current data. As can be observed, various sections of 
the sky are represented in the data. We find this useful since preliminary data 
exploration has shown that correlation between Zapec and Zphot is consistent in 
different parts of the sky. For example, when taking correspondences between 
Zspec and Zphot as shown in Figures [5] and [6j and plotting them in RA and DEC 
space (i.e. astronomical coordinate space) we have the same shape as presented 
in Figure [3] 

This leads us to conclude that digit coincidences of the redshift measures 
are distributed approximately uniformly in the sky and are not concentrated 
spatially. The same occurs for all the other clusters. We will concentrate on 
the very near astronomical objects, represented by redshifts between and 0.6. 
When we plot Zspec versus Zphot we obtain a highly correlated signal as shown 
in Figure |3j right panel. The number of observations that we therefore analyse 
is 443,014. 
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a) RA VS, DEC b) Spectra metric vs. Photometn'c 




Right ascsnsion Spectrometric 

Figure 3: Left: right ascension (RA) versus declination (DEC); Right: z, 
versus Zphot- SDSS data selection used for redshift analysis. 



Q_ 




100000 



Spectroscopic 

Figure 4: Heat plot and histogram for z^pec versus Zp^of Histogram at the top 
shows the Zspec frequencies, histogram at the right shows Zphot frequencies. 
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Looking at Figure [4] it can be seen clearly that most data points fall in 
the range between and 0.2. Here the histogram on the top shows the Zphot 
data points distribution, and the histogram on the right the Zspec data points 
distribution. The heat plot also highlights the area where data points are con- 
centrated, where the yellow colour (white region in monochrome print) shows 
the major density. 

Consequently, now we know that most cluster data points will fall within 
this range (0 and 0.2) if common prefixes of digits in the redshift values, taken 
as strings, are found. 

Figures [5] and [6] show graphically how Zgpec and Zphot correspondences look 
at different levels of decimal precision. On one hand we find that values of 
Zspec and Zphot that have equal precision up to the 3rd decimal digit are highly 
correlated. On the other hand when z^pec and Zphot have only the first digit 
in common, correlation is weak. For example, let us consider the following 
situations for plots [5] and [6) 

• Figure [5] left: let us take the values of Zspec — 0.437 and Zphot = 0.437. 
We have that they share the first digit, the first decimal digit, the second 
decimal digit, and the third decimal digit. Thus, we have a highly cor- 
related signal of the data points that share only up to the third decimal 
digit. 

• Figure [5] right: let us take the values of Zspec = 0.437 and Zp^ot = 0.439. 
We have that they share the first digit, the first decimal digit, and the 
second decimal digit. Therefore, the plot shows data points that share 
only up to the second decimal digit. 

• Figure |6] left: let us take the values of Zspec = 0.437 and Zp^ot — 0.474. 
We have that they share the first digit, and the first decimal digit. Thus, 
the plot shows data points that share only up to the first decimal digit. 

• Figure [6] right: let us take the values of Zgpec = 0.437 and Zp^ot = 0.571. 
We have that they share only the integer part of the value, and that 
alone. Furthermore, this implies redshifts that do not match in succession 
of decimal digits. For example, if we take the values 0.437 and 0.577, the 
fact that the third digit is 7 in each case is not of use. 
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a) Third occimal digit 



b) Second decimal digit 




O.S 0.3 0.4 0.5 
Spectrometric 




0.2 D.3 0.4 0.S 
Spectrometric 



Figure 5: Prefix-wise clustering frequencies depicting 3rd decimal digit coinci- 
dences (left panel), and two decimal digit coincidences (right panel) . 



a) Firat decimal digit 



b} ni^ digit 



0.6 



E 0.3 I- 
I 



£ 0.3 



0.1 0.2 0.3 0.4 0.5 0.8 

Spectrometric 




Figure 6: Prefix-wise clustering frequencies depicting the 1st decimal digit co- 
incidences (left panel), and first digit coincidences (right panel) . 
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Table [2] ( see also Figure [?]) shows the clusters found for all different levels of 
precision. In other words this table allows us to define empirically the confidence 
levels for mapping of Zphot and z^pec- For example, we can expect that 82.8% 
of values for Zsp^.^ and Zphot have at least two common prefix digits. This 
percentage of confidence is derived as follows: the data points that share six, 
five, four, three, two, and one decimal digit (i.e., 4 + 90 + 912 + 8,982 + 85,999 + 
270, 920 = 366, 907 data points. Therefore 82.8% of the data). Additionally we 
observe that around a fifth of the observations share at least 3 digits in common. 
Namely, 4 + 90 + 912 + 8,982 + 85,999 = 95,987 data points, which equals 21.7% 
of the data. 



Digit 


No. 


% 


1 


76,187 


17.19 


Decimal digit 


No. 


% 


1 


270,920 


61.14 


2 


85,999 


19.40 


3 


8,982 


2.07 


4 


912 


0.20 


5 


90 


0.02 


6 


4 






443,094 


100 



Table 2: Data ■points based on the longest common prefix for different levels 
of precision. This includes the integer part of a data point (first digit) and the 
decimal digits of a data point (first to sixth digit). 




1 2 3 4 5 6 7 

Digit 



Figure 7: Frequency distribution for Table [§| The abscissa shows the digit 
positions, where 1 is the first digit, 2 the first decimal digit, 3 the second decimal 
digit and so on. 
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7 Comparative Evaluation with /c-Means 



In this section we compare the Baire-based clustering to resuhs obtained with 
the widely-used fc-means clustering algorithm. 

7.1 Baire-Based Clustering and /c-Means Cluster Compar- 
ison 

In order to establish how "good" the Baire clusters are we can compare them 
with clusters resulting from the fc-means algorithm. Let us recall that our data 
values are in the interval [0,0. 6[ (i.e. including zero values but excluding 0.6). 
Additionally, we have seen that the Baire distance is an ultrametric that is 
strictly defined in a tree. Thus, when building the Baire based clusters we will 
have a root node "0" that includes all the observations (every single data point 
analysed starts with 0). For the Baire distance with exponent —2 we have six 
nodes (or clusters) with indices "00, 01, 02, 03, 04, 05". For the Baire distance 
of exponent -3 we have 60 clusters with indices "000, 001, 002, 003, 004,. ..,059" 
(i.e. ten children for each node 00,.., 05). (Cf. how this adapts the discussion in 
section |4] in a natural way to our data.) 

We carried out a number of comparisons for the Baire distance of two and 
three. For example, by design we have that for = 10^^ there are six clusters. 
Thus we took our data set and applied fc-means with six centroids based on an 
implementation from the Hartigan and Wong [13^ algorithm. Euclidean distance 
is used, as usual, here. The results can be seen in Table |3j where the columns 
are the fc-means clusters and the rows are the Baire clusters. From the Baire 
perspective we see that the node 00 has 97084 data points contained within the 
first fc-means cluster and 64950 observations in the fifth. Looking at node 04, 
all members belong to the third cluster of fc-means. We can see that the Baire 
clusters are closely related to the clusters produced by fc-means at a given level 
of resolution. 





1 


5 


4 


6 


2 


3 


00 


97084 


64950 














01 





28382 


101433 


14878 








02 











18184 


4459 





03 














25309 


1132 


04 

















11116 


05 
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Table 3: Cluster comparison based on cCb — 10^^. Columns show the k-means 
clusters, and the rows show the Baire clusters. The cells present the number of 
data points for a given cluster. 

We can take this procedure further and compare the clusters for defined 
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from 3 digits of precision, and fc-means with fc = 60 centroids as observed in 
Figure [8j 

Looking at the results from the Baire perspective we find that 27 clusters 
are overlapping, 9 clusters are empty, and 24 Baire clusters are completely 
within the boundaries of the ones produced by fc-means as presented in Table [6j 
This last result is better seen in Table |4j which is the subset of Table [6] (see 
Appendix |A]) where complete matches are shown. These tables have been row 
and column permuted in order to clearly appreciate the correspondences. 

It is seen that the match is consistent even if there are differences due to the 
different clustering criteria at issue. We have presented results in such a way as 
to show both consistency and difference. 
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38 


25 


58 


32 


20 


15 


13 


14 


37 


17 
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51 
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015 


3733 
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3495 
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2161 
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1370 
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968 
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896 
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764 
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652 
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026 


























555 























027 


























464 























032 




























484 
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430 

















045 


































398 














044 


































295 














039 





































278 
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260 








041 











































231 
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225 
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350 
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57 


049 














































5 


050 














































1 



Table 4: Subset of cluster comparison based on — 10 ; columns show the 
k-means clusters (k — QQ); rows show Baire nodes. 
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d 



.11 




0.0 0.1 0.2 0.3 0.4 0.5 O.S 

Spectroscopic 

Figure 8: K-means clustering for fc = 60 after 38 iterations. Note that non- 
contiguous groups may he colored the same. 



7.2 Baire and /c-Means Clustering Time Comparison 

In order to compare the time performances of the Baire and /c-means algorithms 
we took di3 = as a basis for the test. Let us remember that for cCb — 10^'^ 
we have potentiaUy 60 clusters for the data in the range [0, 0.6[. Looking at 
the classification from the hierarchical tree viewpoint we have: one cluster for 
first level (i.e., the root node or first digit); six clusters for the second level (i.e., 
first decimal digit or 0, 1, 2, 3, 4, and 5); and ten clusters for the third level or 
second decimal digit. To obtain the potential number of clusters we multiply 
the potential nodes for the first, second and third levels of the tree. That is 
1 • 6 • 10 = 60 clusters. 

Therefore for the time comparison we have (Cb — 10^'^ of 60 clusters, which 
is the parameter given to fc-means as initial number of centroids. The other 
parameter needed is the number of iterations. For fc-means we are interested in 
the average time over many runs. Thus, we use average time over 50 executions 
for each iteration of 1, 5, 10, 15, 20, 28, 30, 35, and 38. 

The results can be observed in Figure [9j It is clear that the time in fc-means 
is linear with respect to the number of iterations (this is well understood in the 
fc-means literature). In this particular case the algorithm converges around the 
iteration number 38. Note that these executions are based on different random 
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initialisations. The times for the fc-means algorithm were obtained with the R 
statistical software. These times were faster than the times obtained by the 
algorithm implemented with Java. 



Iteration 


Average time 


1 


6.81 


5 


12.44 


10 


22.35 


15 


32.30 


20 


42.07 


25 


51.90 


30 


61.94 


35 


71.85 


38 


77.53 



Table 5: Time average for k-means algorithm over 50 executions for each total 
iteration count. 



80 - 
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10 15 20 25 30 35 
tteration 



Figure 9: K-means average processing time in seconds for k — 60. Averages 
are obtained for 9 examples with 50 executions each. 



The Baire method only needs one pass over the data to produce the clus- 
ters. Regarding the time needed, we tested a Java implementation of the Baire 
algorithm. We ran 50 experiments over the SDSS data. It took on average 2.9 
seconds. Compare this to Table [5] 

We recall that this happens because of the large number of iterations involved 
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in the case of /c-means. Even in the case when just one iteration is considered 
for fc-means (note that the algorithm does not converge in that case) the time 
taken is more than double when compared with the Baire (6.8 seconds versus 
2.9 seconds). 



8 Spectrometric and Photometric Digit Distri- 
bution 

We have seen that the Baire ultrametric produces a strict hierarchical classi- 
fication. In the case of Zap^c and Zphot this can be seen as follows. Let us 
take any observed measurement of either case of 

Zspec — ZpjiQt . Let us say 
Zapec — Zphot = 0.1257. Here we have that for \K\ — 4, Zjpec = Zphot- Hierar- 
chically speaking we have that the root node is 0, for the first level where there 
potentially exist 6 nodes (i.e. 0,1,.. .,5); for the second level potentially there 
are 60 nodes; and so on until k ~ \K\ ~ 4, and Zgpec = Zphot, where potentially 
there are 6 • 10 • 10 • 10 = 6, 000 nodes. 

Of course not all nodes will be populated. In fact we can expect that a large 
number of these potential nodes will be empty if the number of observations 
n is lower than the potential number of nodes for a certain precision \K\ (i.e. 
n < lO'-^l). Note that this points to a big storage cost, but in practice the tree 
is very sparsely populated and \K\ small. 

A particular interpretation can be given in the case of an observed data 
point. Following up the above example if we take Zspec — Zphot — 0.1257, a tree 
can be produced to store all observed data that falls within this node. Doing 
this has many advantages from the viewpoint of storing. Access and retrieval, 
for example, is very fast and it is easy to retrieve all the observations that fall 
within a given node and its children. 



With this tree it is a trivial task to build bins for data distribution. Figure 10 
depicts the frequency distribution for a given digit and precision. There are 100 
data points that have been convolved with a Gaussian kernel to produce surface 
planes in order to assemble three-dimensional plots. 

This helps to build a cluster-wise mapping of the data. Following the Fig- 
ure 10 top panel we observe that for the first decimal digit most data obser- 
vations are concentrated in the digits 0, 1, 2, and 3. Then the rest of decimal 
precision data is uniformly distributed, gradually going towards zero when the 
level of precision increases. There is the exception of two peaks, for precision 
equal to 8. This turns out to be useful because when comparing the Zspec and 
Zphot digit distribution we do not find the same peaks in Zphot- This is very 
useful because now we can discriminate which observations are more reliable in 
Zphot through different characteristics of the data associated with the peaks. 



21 



Digit distribution Z, 




Digit distribution Zp^j,, 




Figure 10: Digit distribution for Zspec o,nd Zphotj Top: Spectrometric digit dis- 
tribution; Bottom: Photometric digit distribution. Note that digit distribution 
for Zspec has three peaks, but Zphot has only one. 
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9 Concluding Remarks on the Astronomical Case 
Study and Other Applications 

In the astronomy case clusters generated with the Baire distance can be useful 
when calibrating redshifts. In general, applying the Baire method to cases where 
digit precision is important can be of relevance, specifically to highlight data 
"bins" and some of their properties. 

Note that when two numbers share 3 prefix digits, and base 10 is used, we 
have a Baire distance of = 10"'^. We may not need to define the actual (ul- 
tra)metric values. It may be, in fact, more convenient to work on the hierarchy, 
with its different levels. 
In section 



6.1 



we showed how we could derive that 82.8% of values for z.. 



^spec 

and Zphot have at least two common prefix digits. This is a powerful result in 
practice when we recall that we can find very efficiently where these 82.8% of 
the astronomical objects are. 

Using the Baire distance we showed in section [8] that z^pec and Zp^ot signals 
can be stored in a tree like structure. This is advantageous when measuring the 
digit distribution for each signal. When comparing these distributions, it can 
easily be seen where the differences arise. 

The Baire distance has proved very useful in a number of cases, for instance 
in |26j this distance is used in conjunction with random projection |31j as the 
basis for clustering a large dataset of chemical compounds achieving results com- 
parable to /c-means but with better performance due to the lower computational 
complexity of the Baire-based clustering method. 

Other application areas include text mining and semantic preservation |27] . 
For more details refer to [5j where a number of examples are discussed. 



10 Conclusions 

The Euclidean distance is appropriate for real- valued data. In this work we have 
instead focused on an m-adic (m a non-negative integer) number representation. 

In this work the distance called the Baire distance is presented. This dis- 
tance has been very recently introduced into data analysis. We show how this 
distance can be used to generate clusters in a way that is computationally in- 
expensive when compared with more traditional techniques. As an ultrametric, 
the distance directly induces a hierarchy. Hence the Baire distance lends itself 
very well to the new hierarchical clustering method that we have introduced 
here. 

We presented a case study in this article to motivate the approach, more 
particularly to show how it achieved comparable performance with respect to 
k-means, and finally to demonstrate how it greatly outperforms k-means (and 
a fortiori any traditional hierarchical clustering algorithm) computationally. 
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